Context

The below project is the result of scraping Illinois Department of Agriculture FY19 Apiary Annual Report. This report is in a PDF file format and does not lend itself to easy data extraction. Most extraction from PDFs is still done manually. The goal of this project is to extract data from page 5 of the report, manipulate the data, visualize the data, and export to an excel file.

Libraries

Tabula: Will be used to extract the data from PDF format.

Pandas: Will be used to clean the data, manipulate the data, and export to an excel format.

Plotly: Will be used to create visualizations of the data.

Initial extraction with Tabula

Extracting the table form Page 5 of the annual report provided imperfect results. With a little of adjustment of the coordinates that Tabula uses to extract teh data did provide marginally better results. Ultimately, all of the information from the table was extracted with the addition of a couple of empty rows. We can see this result below.

Cleaning the data frame

In order to get a workable data frame we will be deleting the first 3 rows of the extracted table as well as the last 2 columns. We will recreate the column headers with simplified names that will ease our later work. The last 2 columns are unnecessary for our purposes as they are derived values.

Also, the extracted data frame is of the object data type with a ',' in the numerical values. We will be deleting the ',' from the numerical items and converting them to a numeric data type.

Renaming column headers

Removing ',' and converting to numeric data type

Flipping data frame

We will be calculating the percent change from the previous year using Pandas .pct_change method. In order to do this we will need to first flip the data from so our year value is represented from minimum to maximum.

Data visualization

Below we visualize our data using plotly express. We have created a line graph showing the change in total keepers, colonies, and apiaries from 2001 to 2019.

We also have visualized the percent change over time for our three variables. Here we can see a sharp decline in the number of colonies from 2005 to 2006 with a -27% change year over year. In that same time period the number of keepers grew by 4% and apiaries by 5%. This raises teh question of what caused such a sharp decline in the number of colonies during this time period?

Percent Change YOY